Search results for "Data format"
showing 10 items of 10 documents
A comparison of HDFS compact data formats: Avro versus Parquet
2017
In this paper, file formats like Avro and Parquet are compared with text formats to evaluate the performance of the data queries. Different data query patterns have been evaluated. Cloudera’s open-source Apache Hadoop distribution CDH 5.4 has been chosen for the experiments presented in this article. The results show that compact data formats (Avro and Parquet) take up less storage space when compared with plain text data formats because of binary data format and compression advantage. Furthermore, data queries from the column based data format Parquet are faster when compared with text data formats and Avro. Article in English. HDFS glaustųjų duomenų formatų palyginimas: Avro prieš Parquet…
The problem of interoperability: A common data format for quantum chemistry codes
2007
A common format for quantum chemistry (QC), enhancing code interoperability and communication between different programs, has been designed and implemented. An XML-based format, QC-ML, is presented for representing quantities such as geometry, basis set, and so on, while an HDF5-based format is presented for the storage of large binary data files. Some preliminary applications that use the format have been implemented and are also described. This activity was carried out within the COST in Chemistry D23 project “MetaChem,” in the Working Group “A meta-laboratory for code integration in ab initio methods.” © 2007 Wiley Periodicals, Inc. Int J Quantum Chem, 2007
Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems
2014
The ever-growing amount of data requires highly scalable storage solutions. The most flexible approach is to use storage pools that can be expanded and scaled down by adding or removing storage devices. To make this approach usable, it is necessary to provide a solution to locate data items in such a dynamic environment. This article presents and evaluates the Random Slicing strategy, which incorporates lessons learned from table-based, rule-based, and pseudo-randomized hashing strategies and is able to provide a simple and efficient strategy that scales up to handle exascale data. Random Slicing keeps a small table with information about previous storage system insert and remove operations…
Energy and environmental benefits in public buildings as a result of retrofit actions
2011
Abstract The paper presents the results of an energy and environmental assessment of a set of retrofit actions implemented in the framework of the EU Project “BRITA in PuBs” (Bringing Retrofit Innovation to Application in Public Buildings – no: TREN/04/FP6EN/S07.31038/503135). Outcomes arise from a life cycle approach focused on the following issues: (i) construction materials and components used during retrofits; (ii) main components of conventional and renewable energy systems; (iii) impacts related to the building construction, for the different elements and the whole building. The results are presented according to the data format of the Environmental Product Declaration. Synthetic indi…
Workflow-Based Decision Support for Failure Mode and Effects Analysis
2010
Abstract To achieve high quality designs, processes, and services that meet or exceed industry standards, it is crucial to identify all potential failures throughout a system and work to minimize or prevent their occurrence or effects. This paper presents an innovative approach to Failure Mode and Effects Analysis (FMEA) that uses a Decision Support System (DSS) for supporting the FMEA processes. The DSS is powered by a workflow engine that guides the users through the processes by considering standard work templates or previous similar cases. It is also built as a framework for decision support tools so, beside its default one, different FMEA work instruments can be plugged-in and used thr…
Large expert-curated database for benchmarking document similarity detection in biomedical literature search
2019
P.B. participated in the design, carried out the study, implemented the websites and search systems and wrote the manuscript. RELISH consortium annotated the articles. Y.Z. conceived the study, participated in the initial design, assisted in analyzing data and wrote the manuscript. All authors read, contributed to the discussion and approved the manuscript. Rayner Gonzálezlez-Prendes is member of the RELISH consortium. Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly deve…
OpenTIMS, TimsPy, and TimsR: Open and Easy Access to timsTOF Raw Data
2021
The Bruker timsTOF Pro is an instrument that couples trapped ion mobility spectrometry (TIMS) to high-resolution time-of-flight (TOF) mass spectrometry (MS). For proteomics, lipidomics, and metabolomics applications, the instrument is typically interfaced with a liquid chromatography (LC) system. The resulting LC-TIMS-MS data sets are, in general, several gigabytes in size and are stored in the proprietary Bruker Tims data format (TDF). The raw data can be accessed using proprietary binaries in C, C++, and Python on Windows and Linux operating systems. Here we introduce a suite of computer programs for data accession, including OpenTIMS, TimsR, and TimsPy. OpenTIMS is a C++ library capable …
Lone Star Stack: Architecture of a Disk-Based Archival System
2014
The need for huge storage systems rises with the ever growing creation of data. With growing capacities and shrinking prices, "write once read sometimes" workloads become more common. New data is constantly added, rarely updated or deleted, and every stored byte might be read at any time - a common pattern for digital archives or big data scenarios. We present the Lone Star Stack, a disk based archival storage system building block that is optimized for high reliability and energy efficiency. It provides a POSIX file system interface that uses flash based storage for write-offloading and metadata and the disk-based Lone Star RAID for user data storage. The RAID attempts to spin down disks a…
LoneStar RAID
2016
The need for huge storage archives rises with the ever growing creation of data. With today’s big data and data analytics applications, some of these huge archives become active in the sense that all stored data can be accessed at any time. Running and evolving these archives is a constant tradeoff between performance, capacity, and price. We present the LoneStar RAID, a disk-based storage architecture, which focuses on high reliability, low energy consumption, and cheap reads. It is designed for MAID systems with up to hundreds of disk drives per server and is optimized for “write once, read sometimes” workloads. We use dedicated data and parity disks, and export the data disks as individu…
Code Interoperability and Standard Data Formats in Quantum Chemistry and Quantum Dynamics: The Q5/Q5cost Data Model
2014
Code interoperability and the search for domain-specific standard data formats represent critical issues in many areas of computational science. The advent of novel computing infrastructures such as computational grids and clouds make these issues even more urgent. The design and implementation of a common data format for quantum chemistry (QC) and quantum dynamics (QD) computer programs is discussed with reference to the research performed in the course of two Collaboration in Science and Technology Actions. The specific data models adopted, Q5Cost and D5Cost, are shown to work for a number of interoperating codes, regardless of the type and amount of information (small or large datasets) …